Advanced Graphics and Data Visualization in R test package installs

Lecture 01: “R”-efresher on R and best practices


0.1.0 An overview of Advanced Graphics and Data Visualization in R

“Advanced Graphics and Data Visualization in R” is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. Many of the datasets and examples used in this course will be drawn from real-world datasets and the techniques learned herein aim to be broadly applicable to multiple fields.

This lesson is the first in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.

The structure of the class is a code-along style in R markdown notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto datatools Hub so students can program along with the instructor.


0.2.0 Lecture objectives

This week will be your crash-course on R markdown notebooks and R to refresh on packages and principles that will be relevant throughout our course. In our lectures and your assignments we will be working with some uncurated data to simulate the full experience of working with data from start to finish. It’s important that we are all familiar with, and understand the majority of the tidy data methods that we’ll be using in class so that we can focus on the new material as it appears. We’ll use some standard packages and practices to finesse our data before visualizing it, so let’s R-efresh ourselves.

At the end of this lecture we will have covered the following topics:

  1. Working with R markdown notebooks and best coding practices.
  2. R data types, objects and working with them.
  3. Long-format and tidy data principles using the tidyverse package.
  4. Basic control flow and plotting.

0.3.0 A legend for text format in R markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced

Yellow box: Risk or caution

Green boxes: Recommended reads and resources to learn R

Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.


0.4.0 Lecture and data files used in this course

0.4.1 Weekly Lecture and skeleton files

Each week, new lesson files will appear within your RStudio folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto datatools Hub. You will need to use your UTORid credentials to complete the login process. From there you will find each week’s lecture files in the directory /2024-03-Adv_Graphics_R/Lecture_XX. You will find a partially coded skeleton.Rmd file as well as all of the data files necessary to run the week’s lecture.

Alternatively, you can download the R-Markdown Notebook (.Rmd) and data files from the RStudio server to your personal computer if you would like to run independently of the Toronto tools.

0.4.2 Live-coding HTML page

A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!

0.4.3 Post-lecture PDFs

At the end of each lecture there will be a completed version of the lecture code released as an HTML file under the Modules section of Quercus.

0.4.4 Data used in this lesson

Today’s datasets will focus on using the Ontario public sector salary disclosure, also known as the “Sunshine list”. This list, started in 1996, publishes all public sector servants that with an annual salary at or above $100,000. Although not strictly biological data, this is a great dataset to work with because it contains many observations set across a long time period with enough data to help generate subgroups based on sector, employer, and role!

You can find more information about this dataset on the Ontario public sector salary disclosure webpage

0.4.4.1 Dataset 1: sunshineList_subset_numID_wide.tsv

This is a version of the Sunshine list ranging from 1996-2023. It has been lightly sanitized to reduce in size so the years covered are just about every 5 years. It has further been altered by replacing all names with random numeric identifiers. It is in a tab-separated format and contains nearly 500,000 observations.

0.4.4.2 Dataset 2: static_inflation_calculations.csv

This dataset is a table of the monthly inflation rate starting in January 1914, calculated as a static rate of increase. With the proper analysis, it can be used to compare the consumer price index across various timespans. This data was obtained from the Bank of Canada


0.5.0 Packages used in this lesson

tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2

magrittr will allow us to use a number of different piping/redirect options

viridis helps to create color-blind palettes for our data visualizations

Let’s run our first code cell!

# Packages to help tidy our data
library(tidyverse)
library(magrittr)

# Packages for the graphical analysis section
library(viridis)

1.0.0 Coding in R Markdown Notebooks

Work with your R markdown notebook on the University of Toronto datatools hub will all be contained within a new browser tab with the address bar showing something similar to

https://r.datatools.utoronto.ca/user/calvin.mok@utoronto.ca/rstudio/

All of this is running remotely on a University of Toronto server rather than your own machine.

You’ll see a directory structure from your home folder:

ie /home/rstudio/2024-03-Adv_Graphics_R/ and a folder to Lecture_01_R_Introduction within. Clicking on that, you’ll find Lecture_01.R-efresher.skeleton.Rmd which is the notebook we will use for today’s code-along lecture.


1.1.0 Why is this class using R Markdown Notebooks?

We’ve implemented the class this way to reduce the burden of having to install various programs. While installation can be a little tricky, it’s really not that bad. For this course, however, you don’t need to go through all of that just to improve on your data visualization skills.

R markdown notebooks also give us the option of inserting “markdown” text much like what you’re reading at this very exact moment. So we can intersperse ideas and information between our learning code blocks.

There is, however an appendix section at the end of this lecture detailing how to install the R-kernel itself and the integrated development environment (IDE) called RStudio.


1.2.0 Packages contain useful functions that we’ll use often

So… what are in these packages? A package can be a collection of - functions - data objects - compiled code - functions that override base functions in R

Functions are the basic workhorses of R; they are the tools we use to analyze our data. Each function can be thought of as a unit that has a specific task. A function takes an input, evaluates it using an expression (e.g. a calculation, plot, merge, etc.), and returns an output (a single value, multiple values, a graphic, etc.).

In this course we will frequently rely on a package called the tidyverse which is also composed of a series of other packages we can use to reformat our data like readr, dplyr, tidyr and stringr.


1.3.0 R markdown notebooks run the programming language R

Behind the scenes of each markdown notebook the R kernel is running. As we move from code cell to new code cell, all of the variables or objects we have created are stored within memory. We can refer to these as we run the code and move forward but if you overwrite or change them by mistake, you may to have rerun multiple cell blocks!

There are some options in the “Code” menu that can alleviate these problems such as “Run Region > Run All Chunks Above”. If you think you’ve made a big error by overwriting a key object, you can use that option to “re-initialize” all of your previous code!

Unfortunately, the run order of your code is not tracked. When a code cell is still actively running you will see in the top-right corner of the Console window (lower pane) denoted as a STOP sign icon. Clicking on this will interrupt the kernel and stop code execition, although depending on the complexity of the code it may take a moment.

Remember these friendly keys/shortcuts:

  • Arrow keys to navigate up and down (and within a cell)
  • Ctrl+Shift+Enter to run a cell (both code and markdown)
  • Ctrl+Enter to run a single line of code within a code cell
  • Alt+Ctrl+Enter to run the next cell
  • Ctrl+Shift+C to quickly comment and uncomment single or multiple lines of code
  • Tab can be used while coding to autocomplete variable, function and file names, and even look at a list of possible parameters for functions.
  • Ctrl+Alt+I to insert a new coding cell

1.3.3 Why would you want to use a Markdown Notebook?

Depending on your needs, you may find yourself doing the following:

  • Analysing data for your project using available packages
  • Re-analysing data for your project
  • Analysing multiple datasets for your project
  • Collaborating on data and analyses for your project
  • Explaining your data and analyses to a supervisor or collaborator!

Markdown allows you to alternate between “markdown” notes and “code” that can be run or re-run on the fly.

Each data run and it’s results can be saved individually as a new notebook to compare data and small changes to analyses!

1.3.4 What is markdown language?

Markdown is a markup language that lets you write HTML and Java Script code in combination with other languages. This allows you to make html, pdf, and text documents that are combinations of text and code, enhancing reproducibility, a key aspect in scientific work. Having everything in a single place also boosts productivity during results interpretation - no need to go back and forth between tabs, pages, and documents. They can all be integrated in a single document, allowing for a more fluid narrative of the story that you are communicating to your audience (less distractions for you!). For example, the lines of code below and the text you are reading right now were created in R’s Markdown language. (Do not worry about the R code just yet. We will get there sooner than you think).

As mentioned, markdown also allows you to write in LaTeX, a document preparation system to write mathematical notation. All it takes is to wrap LaTeX code between single dollar signs ($) for inline notation or two double dollar signs ($$), one at the beginning of the equation and one at the end. For example, the equation Yi = beta0 + beta1 xi + epsilon_i, i=1, …, N can be transformed into LaTeX code by adding some characters: ***Y_i = _0 + _1 x_i + _i, i=1, , N***. Now, if we use $$ before and after the LaTeX code, this is what we get:

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, i=1, \dots,N \]

See? Just like that! Here is an example of a table made in Markdown, showing some of the most popular R libraries for data science:

Library Use
tidyverse Simplified tabular-data processing functions
ggplot2 Data visualization package typically included in the tidyverse
shiny Used to create interactive R-based web pages and interfaces
car Popular statistical analysis with Type II and III ANOVA tables

These are just a few examples of what you can do with Jupyter and Markdown. To find out more on how to get the best of Markdown, head on over to the [R Markdown cookbook] (https://bookdown.org/yihui/rmarkdown-cookbook/).

Once you are finished writing your code and interpreting those results in a markdown notebook, you can render the notebook into pdf, html, and many other formats. There are several ways to achieve this. The easiest option is to go to File > Knit Document. Afterwards there should be an option to view in browser at which point you can save as an HTML or print it to PDF.


1.4.0 Following best practices for coding will make life easier

Let’s discuss some important behaviours before we begin coding:

  • Code annotation (commenting)

  • Variable naming conventions

  • Best practices

1.4.1 Annotate your code with the # symbol

Why bother?

  • “Can you rerun this analysis and but change X parameter?” - Anonymous PI

  • “Can you make this plot, but with dashed lines, a different axis, with error bars?” - Anonymous labmate

  • “Can I borrow your code?” - Anonymous collaborator or officemate or PI

  • “Why is that object being sent to that function? What is it returning?” - You, Me, and anyone reading your code

Your worst collaborator is potentially you in 6 days or 6 months. Do you remember what you had for breakfast last Tuesday?

Credit: https://www.testbytes.net/blog/programming-memes/

You can annotate your code for selfish reasons, or altruistic reasons, but annotate your code.

How do I start?

  • It is, in general, part of best coding practices to keep things tidy and organized.

  • A hash-tag # will comment your text. Inside a code cell in an R notebook or anywhere in an R script, all text after a hashtag will be ignored by R and by many other programming languages. It’s very useful to add comments about changes in your code, as well as detailed explanations about your scripts.

  • Put a description of what you are doing near your code at every process, decision point, or non-default argument in a function. For example, why you selected k=6 for an analysis, or the Spearman over Pearson option for your correlation matrix, or quantile over median normalization, or why you made the decision to filter out certain samples.

  • Break your code into sections to make it readable. Scripts are just a series of steps and major steps should be titled/outlined with your reasoning - much like when presenting your research.

  • Give your objects informative object names that are not the same as function names.

Comments may/should appear in three places:

  • At the beginning of your code: What’s the objective of your script?
  • Above every function you create: Why did you have to write your own function versus those that are already available in package x?
  • In-line or in-between lines of code: Why did you write that piece of code? What does it do? Why did you change a function’s defaults?
# Example commenting section
# At the beginning of the script, describing the purpose of your script and what you are trying to solve

bedmasAnswer <- 5 + 4 * 6 - 0 #In line: Describing a part of your code that is not obvious what it is for. 

#---------- Section dividers helps organize code structure ----------#
## Feel free to add extra hash tags to visually separate or emphasize comments

Maintaining well-documented code is also good for mental health!


1.4.2 Naming conventions for files, objects, and functions in R

  • Cannot start with a number
  • Cannot contain spaces or special characters in the name
  • Avoid naming your variables using names already used by R (for, next, while, etc.).
  • Consider appending the object type to your variable name (data frame = df, list = list, etc.)

Stylistically, you have the following options:

  • All lower case: e.g. myfirstobject
  • Period separated: e.g. my.first.object
  • Underscore separated: e.g. my_first_object
  • camelCase1: e.g. myFirstObject
  • CamelCase2: e.g. MyFirstObject (Usually reserved for Class names)

The most important aspects of naming conventions are being concise and consistent! Throughout this course you’ll see a hybrid system that uses the underscore to separate words but a period right before denoting the object type ie this_data.object.


1.4.3 Best Practices for Writing Scripts

  • Start each script with a description of what it does.

  • Then load all required packages.

  • Consider what working directory you are in when sourcing a script.

  • Use comments to mark off sections of code.

  • Put function definitions at the top of your file, or in a separate file if there are many.

  • Name and style code consistently.

  • Break code into small, discrete pieces.

    • This is more easily accomplished when working with different code cells in Jupyter Notebook.
  • Factor out common operations rather than repeating them.

  • Keep all of the source files for a project in one directory and use relative paths to access them.

    • Using relative paths close to or inside your script’s directory makes it easier to package or move around your scripts too.
  • Keep track of the memory used by your program.

  • Always start with a clean environment instead of saving the workspace.

  • Keep track of session information in your project folder.

  • Have someone else review your code.

  • Use version control.

For more information on best coding practices, please visit swcarpentry


1.5.0 Trouble-shooting basics

We all run into problems. We’ll see a lot of mistakes happen in class too! That’s OK if we can learn from our errors and quickly (or eventually) recover.

1.5.1 Determine the location and type of error

Usually when R generates an error it will produce some information about what has happened. This usually includes an error message detailing the kind of error it encountered or an error message generated by the function. It can also include a line where the error was encountered, or the name of the last function that was called before the error was encountered.

1.5.2 Common errors

  • file does not exist: Use getwd() to check where you are working, typelist.files() or the Files pane to check that your file exists there, and setwd() to change your directory if necessary. Preferably, work inside an R project with all project-related files in that same folder. Your working directory will be set automatically when you open the project (this can be done by using File -> New File -> R Notebook and following prompts).

  • typos: R is case sensitive so always check that you’ve spelled everything right. Get used to using the tab autocomplete feature when possible. This can reduce typos and increase your overall programming speed.

  • open quotes, parentheses, brackets:

    • R Markdown Notebooks highlight the current cursor-denoted bracket set in \(\color{grey}{\text{grey}}\). If the bracket is unmatched on either side, a small red \(\color{red}{\text{X}}\) icon will appear on the left-hand side of the code cell, beside the line numbers.
  • data type: Use commands like typeof() and class() to check what type of data you have. Use str() to peak at your data structures if you’re making assumptions about it.

  • unexpected answers: To access the help menu, type help("function"), ?function (using the name of the function that you want to check), or help(package = "package_name").

    • In RStudio: the result will appear in a side-panel on the bottom right of the development environment.
  • function not found: Make sure the package name is properly spelled, installed, AND loaded. Libraries can be loaded to the environment using the function library("package_name"). If you only need one function from a package, or need to specify to what package a function belongs because there are functions with the same name that belong to different packages, you can use a double colon, i.e. package_name::function_name.

  • the R bomb!!: The session aborted can happen for a variety of reasons, like not having enough computational power to perform a task or also because of a system-wide failure.

    • RStudio: restart the Kernel from the menu Session -> Restart R. You will need to rerun your previous cells!
  • cheatsheets: Meet your new best friends: cheatsheets!


1.5.3 Finding answers online

  • 99% of the time, someone has already asked your question

  • Google, Stack overflow, R Bloggers, SEQanswers, Quora, ResearchGate, RSeek, twitter, even reddit

  • Including the program, version, error, package and function helps, so be specific. Sometimes it is useful to include your operating system and version (Windows 10, Ubuntu 18, Mac OS 10, etc.).

  • You may run into assignment questions where the tools I’ve provided in lecture are not enough to reproduce the example output exactly as provided. If you wish to go that extra mile you may need to look for answers elsewhere by consulting references from the class or searching for it yourself. The truth is out there!

1.5.3.1 Asking a question in an online forum

  • Summarize your question in the title (be concise and objective!).
  • Introduce your question, how you ran into the problem, and how you tried to solve it yourself. If you haven’t done the bolded thing, do the bolded thing.
  • Show enough of your code and data for others to try to reproduce the problem/error. This is often referred to as a reproducible example or reprex.
  • Add tags that match your problem.
  • Respond to the feedback and vote for the answered that you picked. People put in their free time to answer and help you.
  • Take a look at StackOverflow’s tips on how to ask questions, as well as CRAN’s

Remember: Everyone looks for help online ALL THE TIME. It is very common. Also, with programming there are multiple ways to come up with an answer, even different packages that let you do the same thing in different ways. You will work on refining these aspects of your code as you go along in this course and in your coding career.

Last but not least, to make life easier: Under the Help pane, there is a list of cheatsheets related to RStudio, the tidyverse and other useful packages.


2.0.0 Foundations of R

There are many tips and tricks to remember about R but here we’ll quickly recall some fundamental knowledge that could be relevant in later lectures.

2.1.0 Assigning variables

If we want to hold on to a number, calculation, or object we need to assign it to a named variable. R has multiple methods for assigning a value to a variable and an order of precedence!

-> and ->> Rightward assignment: we won’t really be using this in our course.

<- and <<- Leftward assignment: assignment used by most ‘authentic’ R programmers but really just a historical keyboard throwback.

= Leftward assignment: commonly used token for assignment in many other programming languages but holds dual meaning!

Notes

  • In R, the assignment of a variable does not produce any standard output.

  • R processes at each new line unless you use a semicolon (;) to separate commands. This applies to assignment as well. One exception being when your function calls are spaced across lines and contained within the ().

  • R calculates the right side of the assignment first the result is then applied to the left.

    • This is a common paradigm in programming that simplifies variable behaviours for counting and tracking results as they build up over time. This also allows us to increment variables or manipulate objects to update them!

2.2.0 Data types are the basic building blocks of R

Data types are used to classify the basic spectrum of values that are used in R. Here’s a table describing some of the common data types we’ll encounter.

Data type Description Example
character Can be single or multiple characters (strings) of letters and symbols. Assigned using double ' or " a#c&E
integer Whole number values, either positive or negative 1
double Any number that is not an integer 7.5
logical Also known as a boolean, representing the state of a conditional (question) TRUE or FALSE
factor Used as a way to make categorical values. Often used as a finite set of values that appear to be string-based in nature except they can be given a user-specified order. Yes/No or Low/Medium/High
NA Represents the value of “Not Available” usually seen when imported data has missing values NA

2.2.1 Data structures hold single or multiple values

The job of data structures is to “host” the different data types. There are five basic types of data structures that we’ll use in R:

Data structure Dimensions Restrictions
vector 1D Holds a single data type
matrix 2D Holds a single data type
array nD Holds a single data type
data frame 2D Holds multiple data types with some restrictions
list 1D (technically) Holds multiple data types AND structures

Sometimes it is helpful to imagine Data Structures as real-world objects to understand how they are shaped and related to each other.


2.2.2 Vectors are like a queue of a single data type

  • Also known as atomic vectors, each element within a vector must be of the same data type: logical, integer, double, character, complex, or raw.

  • For each vector there are two key properties that can be queried with typeof() and length().

  • There is a numerical order to a vector, much like a queue AND you can access each element (piece of data) individually or in groups. Elements are ordered from 1 to length(your_vector) and can be accessed with an indexing operator []

  • Elements of a vector may be named, to facilitate subsetting by character vectors.

  • Elements of a vector may be subset by a logical vector.

# Build a character vector
char.vector <- c("Canada", "United States", "Great Britain")
char.vector
## [1] "Canada"        "United States" "Great Britain"
# subset by a single value
char.vector[2]
## [1] "United States"
# subset by multiple values
char.vector[2:3]
## [1] "United States" "Great Britain"
# subset by removing values (cannot be mixed with positive values)
char.vector[c(-1, -3)]
## [1] "United States"
# subset with repeating multiple values
char.vector[c(1, 2, 3, 3, 2, 1)]
## [1] "Canada"        "United States" "Great Britain" "Great Britain"
## [5] "United States" "Canada"
# Build a named character vector by including variable names
character.vector <- c(a = "Canada", b = "United States", c = "Great Britain")
character.vector
##               a               b               c 
##        "Canada" "United States" "Great Britain"
# subset by element name
character.vector[c("a", "b")]
##               a               b 
##        "Canada" "United States"
# subset by an explicit vector of logicals
character.vector[c(FALSE, TRUE, TRUE)]
##               b               c 
## "United States" "Great Britain"
# Or subset by an implicit vector of logicals
character.vector[character.vector != "Canada"]
##               b               c 
## "United States" "Great Britain"

2.2.2.1 Coercion changes data from one type to another (where applicable)

R will implicitly force (coerce) your vector to be of one data type. In this case, the type that is most inclusive is a character vector. When we explicitly coerce a change from one data type to the next, it is known as casting. You can cast between certain data types and also object types.

  • Type-casting examples: as.logical(), as.integer(), as.double(), as.numeric(), as.character(), and as.factor()

  • Structure casting examples: as.data.frame(), as.list(), and as.matrix()

Importantly, when coercing, the R kernel converts from more specific to general types usually in this order:

logical \(\rightarrow\) integer \(\rightarrow\) numeric \(\rightarrow\) complex \(\rightarrow\) character \(\rightarrow\) list.

# Make a logical vector and display its structure
logical.vector <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
str(logical.vector)
##  logi [1:5] TRUE FALSE TRUE FALSE FALSE
# Make a numeric vector and display its structure
numeric.vector <- c(-1:10)
str(numeric.vector)
##  int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
# Make a mixed vector and display its structure. Take a note of its typing afterwards
mixed.vector <- c(FALSE, TRUE, 1, 2, "three", 4, 5, "six")
str(mixed.vector)
##  chr [1:8] "FALSE" "TRUE" "1" "2" "three" "4" "5" "six"
# Attempt to coerce our vectors
# logical to numeric
as.numeric(logical.vector)
## [1] 1 0 1 0 0
# numeric to logical
as.logical(numeric.vector)
##  [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
# numeric to character
as.character(numeric.vector)
##  [1] "-1" "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
# mixed to a numeric. Note what happens when elements cannot be converted
as.numeric(mixed.vector)
## Warning: NAs introduced by coercion
## [1] NA NA  1  2 NA  4  5 NA

2.2.3 Data Frames hold tabular data

2.2.3.1 Object classes

Now that we have had the opportunity to create a few different vector objects, let’s talk about what an object class is. An object class can be thought of as a structure with attributes that will behave a certain way when passed to a function. Because of this

  • data frames, lists and matrices have their own classes
  • vectors inherit from their data type (e.g. vectors of characters behave like characters)

Some R package developers have created their own object classes. For example, many of the functions in the tidyverse generate tibble objects. They behave in most ways like a data.frame but have a more refined print structure, making it easier to see information such as column types when viewing them quickly. In general, from a trouble-shooting standpoint, it is good to be aware that your data may need to be formatted to fit a certain class of object when using different packages.

After we are done tidying most of our datasets, they will be in tibble objects, but all of the basic data frame functions apply to these as well.


2.2.3.2 Data frames are groups of vectors aligned as columns

While matrices are 2-dimensional structures limited to a single specific type of data within each instance, data frames treat each column of the structure like a vector. The data frame, however, can have multiple data types mixed across each different column. Data frame rules to remember are:

  1. Within a column, all members must be of the same data type (ie character, numeric, Factor, etc.)
  2. All columns must have the same number of rows (hence the matrix shape)

Data frames allows us to generate tables of mixed information much like an Excel spreadsheet.

# Generate a data frame with different variable/column types
mixed.df <- data.frame(country = character.vector,
                       values = numeric.vector[2:4],
                       commonwalth = logical.vector[1:3])

# View the data frame
mixed.df
##         country values commonwalth
## a        Canada      0        TRUE
## b United States      1       FALSE
## c Great Britain      2        TRUE
# Check the structure of the data frame
str(mixed.df)
## 'data.frame':    3 obs. of  3 variables:
##  $ country    : chr  "Canada" "United States" "Great Britain"
##  $ values     : int  0 1 2
##  $ commonwalth: logi  TRUE FALSE TRUE

2.2.3.3 Some useful data frame commands (for now)

  • nrow(data_frame) retrieves the number of rows in a data frame.

  • ncol(data_frame) retrieves the number of columns in a data frame.

  • data_frame$column_name accesses a specific column by it’s name.

  • data_frame[x,y] accesses a specific element located at row x, column y

  • rownames(data_frame) retrieves or assigns row names to your data frame

  • colnames(data_frame) retrieves or assigns columns names to your data frame

There are many more ways to access and manipulate data frames that we’ll explore further down the road. Let’s review some basic data frame code.

# query the dimensions of the data frame
dim(mixed.df)
## [1] 3 3
nrow(mixed.df)
## [1] 3
ncol(mixed.df)
## [1] 3
# retrieve row and column names
rownames(mixed.df)
## [1] "a" "b" "c"
colnames(mixed.df)
## [1] "country"     "values"      "commonwalth"
# print the mixed data frame
mixed.df
##         country values commonwalth
## a        Canada      0        TRUE
## b United States      1       FALSE
## c Great Britain      2        TRUE
# Access portions of the data frame
# a single column
str(mixed.df$country)
##  chr [1:3] "Canada" "United States" "Great Britain"
# a single element
mixed.df[2, 3]          # Use index position
## [1] FALSE
mixed.df[3, "country"]  # Mix index position and column names
## [1] "Great Britain"
# multiple rows
mixed.df[c(1,3),]      # Use vectors to select groups of rows/columns
##         country values commonwalth
## a        Canada      0        TRUE
## c Great Britain      2        TRUE
mixed.df[-2, ]          # Use negative values to EXCLUDE rows/columns
##         country values commonwalth
## a        Canada      0        TRUE
## c Great Britain      2        TRUE

2.2.4 Lists are amorphous bundles strung together with code

Lists can hold mixed data types of different lengths. These are especially useful for bundling data of different types to pass around your scripts, and functions, or when receiving output from functions! Rather than having to call multiple variables by name, you can store them in a single list!

If you forget the contents of your list, use the str() function to check out its structure. str() will tell you the number of items in your list and their data types.

# Make a named list of various items
mixed.list <- list(countries = character.vector, values = numeric.vector, mixed.data = mixed.df)

# Look at some information about our list
str(mixed.list)
## List of 3
##  $ countries : Named chr [1:3] "Canada" "United States" "Great Britain"
##   ..- attr(*, "names")= chr [1:3] "a" "b" "c"
##  $ values    : int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
##  $ mixed.data:'data.frame':  3 obs. of  3 variables:
##   ..$ country    : chr [1:3] "Canada" "United States" "Great Britain"
##   ..$ values     : int [1:3] 0 1 2
##   ..$ commonwalth: logi [1:3] TRUE FALSE TRUE
# What are the names of the elements in mixed.list
names(mixed.list)
## [1] "countries"  "values"     "mixed.data"

Note the $ sign on the left-hand side of the str() output. What follows is the name of our list element proceeded by the : and a description of that element.

# Lists can often be unnamed
unnamed.list <- list(character.vector, numeric.vector, mixed.df)

# Look at some information about our unnamed list
str(unnamed.list)
## List of 3
##  $ : Named chr [1:3] "Canada" "United States" "Great Britain"
##   ..- attr(*, "names")= chr [1:3] "a" "b" "c"
##  $ : int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
##  $ :'data.frame':    3 obs. of  3 variables:
##   ..$ country    : chr [1:3] "Canada" "United States" "Great Britain"
##   ..$ values     : int [1:3] 0 1 2
##   ..$ commonwalth: logi [1:3] TRUE FALSE TRUE
names(unnamed.list)
## NULL

2.2.4.1 Accessing elements from a list is accomplished in multiple ways

Accessing lists is much like opening up a box of boxes of chocolates. You never know what you’re gonna get when you forget the structure!

You can access elements with a mixture of number and naming annotations much like data frames. Also [[x]] is meant to access the xth “element” of the list. Note that unnamed lists cannot be accessed with naming annotations.

  • [x] returns a list object with your element(s) of choice in the list.
  • [[x]] returns a “single” element only but that element could be a vector, data frame, list, etc.
# Subset our list with []
mixed.list[c(1,3,2)]
## $countries
##               a               b               c 
##        "Canada" "United States" "Great Britain" 
## 
## $mixed.data
##         country values commonwalth
## a        Canada      0        TRUE
## b United States      1       FALSE
## c Great Britain      2        TRUE
## 
## $values
##  [1] -1  0  1  2  3  4  5  6  7  8  9 10
str(mixed.list["values"])
## List of 1
##  $ values: int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
# Pull out a single element
mixed.list[[2]]
##  [1] -1  0  1  2  3  4  5  6  7  8  9 10
mixed.list[["countries"]]
##               a               b               c 
##        "Canada" "United States" "Great Britain"
# Give a vector as input to [[]]
mixed.list[[c(1,3)]]
## [1] "Great Britain"
# vs equivalent
mixed.list[[1]][3]
##               c 
## "Great Britain"
# Access a single element from a data frame nested in a list
mixed.list[[c(3,1,1)]]
## [1] "Canada"
# vs equivalient
mixed.list[[3]][1,1]
## [1] "Canada"

Comprehension Question 2.2.4.1: Suppose we had a list named multiDF.list consisting of 3 data frames, as shown in the following code cell. How would you subset the 2nd and 3rd data frames into their own list? How would you access the “values” column from the 3rd data frame? Use the following code cell to help you out.

# Comprehension answer code 2.2.4.1

multiDF.list = list(mixed.df, rbind(mixed.df, mixed.df), rbind(mixed.df, mixed.df, mixed.df))

str(multiDF.list)

# Subset the 2nd and 3rd dataframes as their own list

...

# Output the "values" column of the 3rd dataframe

...

2.3.0 Factors codify your data into categorical variables

Ah, the dreaded factors! A factor is a class of object used to encode a character vector into categories. They are used to store categorical variables and although it is tempting to think of them as character vectors this is a dangerous mistake. Adding or changing data in a data frame with pre-existing factors requires that you match factor levels correctly as well.

Factors make perfect sense if you are a statistician designing a programming language (!) but to everyone else they exist solely to torment us with confusing errors. At its core, a factor is really just an integer vector or character data with an additional attribute, called levels(), which defines the accepted values for that variable.

2.3.0.1 Why use factors?

Why not just use character vectors, you ask?

Believe it or not factors do have some useful properties. For example, factors allow you to specify all possible values a variable may take even if those values are not in your data set. Think of conditional formatting in Excel. We also use them heavily in generating statistical analyses and in grouping data when we want to visualize it.

2.3.0.2 A historical note about R 4.0.x versus r 3.x.x

Since the inception of R, data.frame() calls have been used to create data frames but the default behaviour was to convert strings (and characters) to factors! This is a throwback to the purpose of R, which was to perform statistical analyses on datasets with methods like ANOVA which examine the relationships between variables (ie factors)!

As R has become more popular and its applications and packages have expanded, incoming users have been faced with remembering this obscure behaviour, leading to lost hours of debugging grief as they wonder why they can’t pull information from their dataframes to do a simple analysis on C. elegans strain abundance via molecular inversion probes in datasets of multiplexed populations. #SuspiciouslySpecific

That meant that users usually had to create data frames including the toggle

data.frame(name=character(), value=numeric(), stringsAsFactors = FALSE)

Fret no more! As of R 4.x.x the default behaviour has switched and stringsAsFactors = FALSE is the default! Now if we want our character columns to be factors, we must convert them explicitly, or turn this behaviour on at the outset of creating each data frame!

# Generate a data frame and include factors for all character-based content
str(data.frame(country = character.vector,
               values = numeric.vector[2:4],
               commonwealth = logical.vector[1:3],
               continent = c("North America", "North America", "Europe"),
               stringsAsFactors = TRUE)
    )
## 'data.frame':    3 obs. of  4 variables:
##  $ country     : Factor w/ 3 levels "Canada","Great Britain",..: 1 3 2
##  $ values      : int  0 1 2
##  $ commonwealth: logi  TRUE FALSE TRUE
##  $ continent   : Factor w/ 2 levels "Europe","North America": 2 2 1
# Explicitly define factors for specific variables.
str(data.frame(country = factor(character.vector),
               values = numeric.vector[2:4],
               commonwealth = logical.vector[1:3],
               continent = c("North America", "North America", "Europe"),
               stringsAsFactors = FALSE)
    )
## 'data.frame':    3 obs. of  4 variables:
##  $ country     : Factor w/ 3 levels "Canada","Great Britain",..: 1 3 2
##  $ values      : int  0 1 2
##  $ commonwealth: logi  TRUE FALSE TRUE
##  $ continent   : chr  "North America" "North America" "Europe"

2.3.1 Specify factors and their levels explicitly during or after data.frame creation

From above, you can specify which columns of strings are converted to factors at the time of declaring your column information. Alternatively you can coerce character vectors to factors after generating them.

R’s default behaviour puts factor levels in alphabetical order. This can cause problems if we aren’t aware of it. You can check the order of your factor levels with the levels() command. Furthermore you can specify, during factor creation, your level order.

Always check to make sure your factor levels are what you expect.

With factors, we can deal with our character levels directly, or their numeric equivalents.

# Generate a data frame and include factors
str(data.frame(country = character.vector,
               values = numeric.vector[2:4],
               commonwealth = logical.vector[1:3],
               continent = factor(c("North America", "North America", "Europe"),
                                     levels = c("North America", "Europe"))
              )
   )
## 'data.frame':    3 obs. of  4 variables:
##  $ country     : chr  "Canada" "United States" "Great Britain"
##  $ values      : int  0 1 2
##  $ commonwealth: logi  TRUE FALSE TRUE
##  $ continent   : Factor w/ 2 levels "North America",..: 1 1 2
# Coerce a factor after the fact

# Build a data frame
mixed.df <- data.frame(country = character.vector,
                      values = numeric.vector[2:4],
                      commonwealth = logical.vector[1:3],
                      continent = c("North America", "North America", "Europe"))

# Set our factor after declaring the data frame
mixed.df$continent <- factor(mixed.df$continent, levels=c("North America", "Europe"))


str(mixed.df)
## 'data.frame':    3 obs. of  4 variables:
##  $ country     : chr  "Canada" "United States" "Great Britain"
##  $ values      : int  0 1 2
##  $ commonwealth: logi  TRUE FALSE TRUE
##  $ continent   : Factor w/ 2 levels "North America",..: 1 1 2

2.3.2 More facts about factors

  1. Use levels() to list the levels and their order for your factor

  2. To rename levels of a factor, declare and reassign your factor.

  3. Move a single level to the first position within your factor levels with relevel().

  4. Factor levels can be assigned an order of precedence during their creation with the parameter ordered = TRUE.

  5. Define labels for your factor during their creations with the parameter labels = c(). Note that level order is assigned before labels are added to your data. You are essentially labeling the integer assigned to your factor levels so be careful when using this parameter!

Advanced factors functions with forcats. If you’re looking for more advanced functions that you can use to manipulate, sort or update factors, check out the forcats function. With it, you can refactor based on functions, frequency, or explicitly re-specify the order of one or more factor levels. We’ll see this package in action in more detail during later lectures.


2.4.0 Mathematical operations on data frames and arrays

Yes, you can treat data frames and arrays like large lists where mathematical operations can be applied to individual elements or to entire columns or more!

2.4.1 Mathematical operations are applied differently depending on data type

  • numeric data: operations applied as expected
  • non-numeric (ie characters): error will be thrown
  • factors: warning message and NAs returned
  • logical data (TRUE/FALSE): coercion to numeric before applying operations

Therefore be careful to specify your numeric data for mathematical operations.

mixed.df
##         country values commonwealth     continent
## a        Canada      0         TRUE North America
## b United States      1        FALSE North America
## c Great Britain      2         TRUE        Europe
# Add to each element
mixed.df$values + 3
## [1] 3 4 5
# Add columns to each other
mixed.df$values + mixed.df$values
## [1] 0 2 4
# multiply each element by a constant
mixed.df$values * 4
## [1] 0 4 8
# implicit coercion of logical to integer
mixed.df$commonwealth * 5 
## [1] 5 0 5
# Perform math on a factor
mixed.df$continent * 6
## Warning in Ops.factor(mixed.df$continent, 6): '*' not meaningful for factors
## [1] NA NA NA
# Convert the factor to a numeric first
as.numeric(mixed.df$continent) * 7
## [1]  7  7 14
# Can we perform math on non-numeric variables?
mixed.df$country + 8
## Error in mixed.df$country + 8: non-numeric argument to binary operator

2.5.0 Using the apply() family of functions to perform actions across data structures

The above are illustrative examples to see how our different data structures behave. In reality, you will want to do calculations across rows and columns, and not on your entire matrix or data frame.

2.5.1 The apply() function will recognize basic functions and use them on vectorized data

For example, we might have a count table where rows are genes, columns are samples, and we want to know the sum of all the counts for a gene. To do this, we can use the apply() function. apply() Takes an array, matrix (or something that can be coerced as such, like a numeric data frame), and applies a function over rows or columns. The apply() function takes the following parameters:

  • X: an array. matrix or something that can be coerced to these objects
  • MARGIN: defines how to apply the function; 1 = rows, 2 = columns.
  • FUN: the function to be applied. Supplied as a function name without the () suffix
  • ...: this notation means we can pass additional parameters to our function defined by FUN.

and returns a vector, array or list depending on the nature of X.

Let’s practice by invoking the sum function.

# Make a sample data frame of numeric values only
numeric.df = data.frame(geneA = numeric.vector, geneB = numeric.vector*2, geneC = numeric.vector*3)

# We now have a 12x3 dataframe
numeric.df
##    geneA geneB geneC
## 1     -1    -2    -3
## 2      0     0     0
## 3      1     2     3
## 4      2     4     6
## 5      3     6     9
## 6      4     8    12
## 7      5    10    15
## 8      6    12    18
## 9      7    14    21
## 10     8    16    24
## 11     9    18    27
## 12    10    20    30
# Apply sum() to each rows
apply(numeric.df, MARGIN = 1, sum)
##  [1] -6  0  6 12 18 24 30 36 42 48 54 60
# Apply sum() to each column
apply(numeric.df, 2, sum)
## geneA geneB geneC 
##    54   108   162

2.5.2 The other members of the apply() family

There are 3 additional members of the apply() family that perform similar functions with varying outputs

  1. lapply(data, FUN, ...) is usable on dataframes, lists, and vectors. It returns a list as output.
  • It will coerce non-list objects to a list
  • Additional arguments to FUN will be applied from the ...
  1. sapply(data, FUN, ...) works similarly to lapply() except it tries to simplify the output to the most elementary data structure possible. i.e. it will return the simplest form of the data that makes sense as a representation.

  2. mapply(FUN, data, ...) is short for “multivariate” apply and it applies a function to multiple lists or multiple vector arguments.

# Use lapply on the columns of numeric.df
lapply(numeric.df, sum)
## $geneA
## [1] 54
## 
## $geneB
## [1] 108
## 
## $geneC
## [1] 162
str(lapply(numeric.df, sum))
## List of 3
##  $ geneA: int 54
##  $ geneB: num 108
##  $ geneC: num 162
# Use sapply on the columns of numeric.df
sapply(numeric.df, sum)
## geneA geneB geneC 
##    54   108   162
# We are returned a named vector
str(sapply(numeric.df, sum))
##  Named num [1:3] 54 108 162
##  - attr(*, "names")= chr [1:3] "geneA" "geneB" "geneC"
# Using lapply and sapply and sum on an actual list
sum.list <- list(numeric.vector, numeric.df)
str(sum.list)
## List of 2
##  $ : int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
##  $ :'data.frame':    12 obs. of  3 variables:
##   ..$ geneA: int [1:12] -1 0 1 2 3 4 5 6 7 8 ...
##   ..$ geneB: num [1:12] -2 0 2 4 6 8 10 12 14 16 ...
##   ..$ geneC: num [1:12] -3 0 3 6 9 12 15 18 21 24 ...
# lapply on the list returns a list
lapply(sum.list, sum)
## [[1]]
## [1] 54
## 
## [[2]]
## [1] 324
# sapply on the list returns a vector
sapply(sum.list, sum)
## [1]  54 324
# Use lapply to select portions from a list
sum.list <- list(numeric.df, numeric.df)

# Extract the first row from each member of the list
print("sum.list first rows:")
## [1] "sum.list first rows:"
lapply(sum.list, "[", 1, )
## [[1]]
##   geneA geneB geneC
## 1    -1    -2    -3
## 
## [[2]]
##   geneA geneB geneC
## 1    -1    -2    -3
# Take a close look at what sapply returns in this case
sapply(sum.list, "[", 1,)
##       [,1] [,2]
## geneA -1   -1  
## geneB -2   -2  
## geneC -3   -3
# Extract the 2nd column from each member of the list
print("sum.list second columns:")
## [1] "sum.list second columns:"
lapply(sum.list, "[", , 2)
## [[1]]
##  [1] -2  0  2  4  6  8 10 12 14 16 18 20
## 
## [[2]]
##  [1] -2  0  2  4  6  8 10 12 14 16 18 20
# Take a close look at what sapply returns in this case
sapply(sum.list, "[", , 2)
##       [,1] [,2]
##  [1,]   -2   -2
##  [2,]    0    0
##  [3,]    2    2
##  [4,]    4    4
##  [5,]    6    6
##  [6,]    8    8
##  [7,]   10   10
##  [8,]   12   12
##  [9,]   14   14
## [10,]   16   16
## [11,]   18   18
## [12,]   20   20

Notice how in using sapply() to extract from a list of data frames, a single matrix was returned - a single output in the simplest form that maintains structure.

Now let’s give mapply() a try.

# Use mapply in an example on numeric.vector
mapply(sum, numeric.vector, numeric.vector)
##  [1] -2  0  2  4  6  8 10 12 14 16 18 20
numeric.vector + numeric.vector
##  [1] -2  0  2  4  6  8 10 12 14 16 18 20
# Use mapply in an example on numeric.df
mapply(sum, numeric.df, numeric.df)
## geneA geneB geneC 
##   108   216   324
# Use mapply on the rep function to see its output
mapply(rep, c("repeat", "this", "phrase"), 4)
##      repeat   this   phrase  
## [1,] "repeat" "this" "phrase"
## [2,] "repeat" "this" "phrase"
## [3,] "repeat" "this" "phrase"
## [4,] "repeat" "this" "phrase"

So from our observations, we see that mapply() it looks like

  1. For our vectors, it has summed them element-wise. This is equivalent to numeric.vector + numeric.vector
  2. For our dataframes, it has calculated the column-wise sums and matched them with columns of the same position in the other data frames to calculate a total sum.

In each case, it is applying sum to the first elements (or columns) of each argument, then the second elements, and so on. So new sets are formed for each element-wise position before applying the FUN argument!

2.6.0 Special data: NA and NaN values

Missing values in R are handled as NA (Not Available). Impossible values (like the results of dividing by zero) are represented by NaN (Not a Number). These types of values can be considered null values. These two types of values, specially NAs, have special ways to be dealt with, otherwise it may lead to errors in some functions.

For our purposes, we are not interested in keeping NA data within our datasets so we will usually detect and remove them or replace them within our data after it is imported.

2.6.1 Helpful functions and information for dealing with NA data

  1. is.na() returns a logical vector reporting which values from your query are NA.
  2. complete.cases() returns row-matched logical vector with a value of TRUE for rows without any NA values.
  3. Some functions can ignore NA values with the na.rm = TRUE parameter: ie mean(), sum() etc.
  4. Additional functions in the tidyr package can also be used to work with NA values.
# Add some NAs to our data frame
mixed.df <- data.frame(country = character.vector,
                      values = c(3, NA, 9),
                      commonwealth = logical.vector[1:3],
                      continent = c("North America", "North America", "Europe"),
                      measure = c("metric", NA, "metric")
                      )
# Look at our updated data frame
mixed.df
##         country values commonwealth     continent measure
## a        Canada      3         TRUE North America  metric
## b United States     NA        FALSE North America    <NA>
## c Great Britain      9         TRUE        Europe  metric
# Which entries are NA?
is.na(mixed.df)
##   country values commonwealth continent measure
## a   FALSE  FALSE        FALSE     FALSE   FALSE
## b   FALSE   TRUE        FALSE     FALSE    TRUE
## c   FALSE  FALSE        FALSE     FALSE   FALSE
# Which rows are incomplete?
complete.cases(mixed.df)
## [1]  TRUE FALSE  TRUE
# Use some math functions 
sum(mixed.df$values, na.rm=TRUE)
## [1] 12

3.0.0 Welcome to the tidyverse

Each dataset has it’s own problems. Image from: https://cfss.uchicago.edu/notes/tidy-data/

Let’s begin with some definitions:

In data science, long format is preferred over wide format because it allows for an easier and more efficient subsetting and manipulation of the data. To read more about wide and long formats, visit here.

Why tidy data?

Data cleaning/wrangling (or dealing with ‘messy’ data) accounts for a huge chunk of a data scientist’s time. Ultimately, we want to get our data into a ‘tidy’ format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that standardized data structure can help this process along.

In Tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Every cell is a single value.

This seems pretty straightforward, and it is. It is the datasets you get that will not be straightforward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.

3.0.1 The 5 most common problems with messy datasets are:

  • common headers are values, not variable names
  • multiple variables are stored in one column
  • variables are stored in both rows and columns
  • a single variable stored in multiple tables
  • multiple types of observational units are stored in the same table

Observational units: Of the three rules, the idea of observational units might be the hardest to grasp. As an example, you may be tracking a puppy population across 4 variables: age, height, weight, fur colour. Each observation unit is a puppy. However, you might be tracking the same puppies across multiple measurements - so a time factor applies. In that case, the observation unit now becomes puppy-time. Now each puppy-time measurement belongs in a different table (at least by tidy data standards). This, however, is a simple example and things can get more complex when taking into consideration what defines an observational unit. Check out this blog post by Claus O. Wilke for a little more explanation.

Let’s begin this journey with data import.


3.1.0 Opening and saving files with the readr package - “All roads lead to Rome..”

… but not all roads are easy to travel.

Depending on format, data files can be opened in a number of ways. The simplest methods we will use involve the readr package as part of the tidyverse. These functions have already been developed to simplify the import process for users. The functions we will use most often are:

  • Read in a delimited file: read_delim(), read_csv(), read_tsv(), read_csv2() [European datasets]

  • Read in from a file, line by line: read_lines()

Let’s read in our first dataset so that we can convert from wide to long format.

# Use read_csv to look at our compiled Sunshine list
sunshineWide.df <- read_csv("./data/sunshineList_subset_numID_wide.tsv")
## Rows: 516966 Columns: 8
## -- Column specification ------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr (7): 1996, 2001, 2006, 2011, 2016, 2021, 2023
## dbl (1): numericID
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the structure and characteristics of sunshineWide.df
str(sunshineWide.df, give.attr = FALSE)
## spc_tbl_ [516,966 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ numericID: num [1:516966] 19206219 10627148 17402443 18586778 17626134 ...
##  $ 1996     : chr [1:516966] "Other Public Sector Employers_Addiction Research Foundation_President & Ceo_$194,890.40_$711.24" "Other Public Sector Employers_Addiction Research Foundation_Dir., Soc. Eval. Research & Act. Dir., Clin. Resear"| __truncated__ "Other Public Sector Employers_Addiction Research Foundation_V.p., Research & Coordinator, Intern. Programs_$149,434.48_$512.58" "Ontario Public Service_Agriculture,Food And Rural Affairs_Deputy Minister_$109,382.92_$4,921.68" ...
##  $ 2001     : chr [1:516966] NA NA NA NA ...
##  $ 2006     : chr [1:516966] NA NA NA NA ...
##  $ 2011     : chr [1:516966] NA NA NA NA ...
##  $ 2016     : chr [1:516966] NA NA NA NA ...
##  $ 2021     : chr [1:516966] NA NA NA NA ...
##  $ 2023     : chr [1:516966] NA NA NA NA ...
head(sunshineWide.df)
## # A tibble: 6 x 8
##   numericID `1996`                     `2001` `2006` `2011` `2016` `2021` `2023`
##       <dbl> <chr>                      <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
## 1  19206219 Other Public Sector Emplo~ NA     NA     NA     NA     NA     NA    
## 2  10627148 Other Public Sector Emplo~ NA     NA     NA     NA     NA     NA    
## 3  17402443 Other Public Sector Emplo~ NA     NA     NA     NA     NA     NA    
## 4  18586778 Ontario Public Service_Ag~ NA     NA     NA     NA     NA     NA    
## 5  17626134 Hospitals_Ajax And Picker~ Hospi~ NA     NA     NA     NA     NA    
## 6  13808138 Colleges_Algonquin Colleg~ Colle~ Colle~ Colle~ NA     NA     NA
tail(sunshineWide.df)
## # A tibble: 6 x 8
##   numericID `1996` `2001` `2006` `2011` `2016` `2021` `2023`                    
##       <dbl> <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>                     
## 1  13469140 NA     NA     NA     NA     NA     NA     Other Public Sector Emplo~
## 2  19670454 NA     NA     NA     NA     NA     NA     Ontario Power Generation_~
## 3  13131015 NA     NA     NA     NA     NA     NA     Other Public Sector Emplo~
## 4  18645579 NA     NA     NA     NA     NA     NA     Other Public Sector Emplo~
## 5  16717888 NA     NA     NA     NA     NA     NA     Other Public Sector Emplo~
## 6  15417227 NA     NA     NA     NA     NA     NA     Other Public Sector Emplo~
any(is.na(sunshineWide.df))
## [1] TRUE

3.1.1 Our Sunshine list data covers 28 years of income

From looking at our data, we see there are 29 columns across 516966 observations. This represents 28 years of income data on 516,966 unique individual IDs. From the outset, we can see there are some issues with the data set that we’ll want to resolve and we’ll work through some tidyverse functions in order to do that. First let’s quickly review some of the potential problems with our dataset.

  1. Under each year (column) we see a very interesting collection of information. The data actually represents many variables and information that appear to be separated by the underscore (_). These data actually represent 5 variables: Sector, Employer, Job Title, Salary Paid, and Taxable Benefits.

  2. There are many NA values present. This is to be expected given that not every individual can possibly have income data across every year covered. Some individuals will enter later or retire during the 28-years of data.

  3. Our Sector names and other column values will be quite inconsistent and we’ll want to address these by reformating them properly.

In the end, we want to convert our data to look something like this:

numericID <fct> salary <dbl> taxableBenefits <dbl> calendarYear <int> sector <fct> employer <chr> title<chr>
19206219 $194,890.40 $711.24 1996 Other Public Sector Employers Addiction Research Foundation President & CEO
10627148 $115,603.62 $403.41 1996 Other Public Sector Employers Addiction Research Foundation Dir., Soc. Evl. Research & Act. Dir., Clin. Research
17402443 $149,434.48 $512.58 1996 Other Public Sector Employers Addiction Research Foundation V.p., Research & Coordinator, Intern. Programs

Before we tackle these issues, let’s go ahead and review some of the tools at our disposal.


3.2.0 The tidyverse package and it’s contents make manipulating data easier

While the tidyverse is composed of multiple packages, we will be focused on working with a subset of these: dplyr, tidyr, and stringr.

3.2.0.1 Redirect your output with %>% whenever you can!

To save on making extra variables in memory and to help make our code more concise, we should use of the %>% symbol. This is a redirection or pipe symbol similar to the | in Unix operating systems and is used for redirecting output from one function to the input of another. By thoughtfully combining this with other commands, we can alter or query our datasets with ease.

We’ll also introduce the %<>% in this class. This is a little more advanced but it allows us to assign the final product of our chain of commands to the very first object.

Whenever we are redirecting, we are implicitly passing our output to the first parameter of the next function. We may not always want to use the entirety of the output or we may want to also reuse that redirected output as part of another parameter. To do so we can use . to explicitly denote the redirected output.

Native Piping in R: Note that as of R vs 4.1.0 a native pipe symbol |> was added to the language which has the same function as the %>% symbol we are using. However, RStudio holds a shortcut set of keys ctrl + shift + m which can make it more convenient to insert while coding.

3.2.0.2 dplyr has functions for accessing and altering your data

We will use the “verbs” of the dplyr function often to massage the look of our data by changing column names or subsetting it. The most common verbs you will see in this course are.

Function(s) Description
arrange() Arranging rows by column values
count(), tally() Counting observations by group
distinct() Subsetting rows by distinct or unique values
filter() Subsetting rows by column values
mutate(), transmute() Create, modify, or delete columns
select() Subset columns using their names and types
summarize() or summarise() Summarize by groups to fewer rows
group_by() vs. ungroup() group by one or more variables
rowwise() group data as single rows for calculations across each
rename(), and relocate() Rename or move columns

3.2.0.3 tidyr has additional functions for reshaping our data

The tidyr package will be most useful when we are trying to reshape our data from the wide to the long format or vice versa. This is much more useful for when we want to drastically alter portions or all of our data.

Function(s) Description
pivot_longer() Pivot data from wide to long
pivot_wider() Pivot data from long to wide
extract() Extract a character column into multiple groups
separate() Separate a character column into multiple groups
unite() Unite multiple columns into one by pasting strings
drop_na() Drop rows containing missing values
replace_na() Replace NAs with specific values

3.2.0.4 stringr provides functionality for searching data based on regular expressions

The stringr package will come in most useful when we are trying to fix string issues with our data. Many time our headers or data will contain spaces or poor formatting. Many times we will prefer to have our headers in lower case format, with any spaces replaced by an _. We’ll also use verbs from this package to make any variables or data more concise.

Category Function(s) Description
String analysis str_count() Count the number of matches in a string
String retrieval str_detect() Detect the presence (or absence) of a pattern in string
str_extract() and str_extract_all() Extract matching patterns from a string
str_match() and str_match_all() Extract matched groups from a string
str_subset() and str_which() Keep or find strings matching a pattern
String alteration str_remove() and str_remove_all() Remove matched patterns from a string
str_split(), str_split_fixed(), and str_split_n() Split a string into pieces
str_c() Concatenate multiple strings into a single string with optional separator
str_flatten() Flatten a string
str_sub() Extract and replace substrings from a character vector
str_to_upper() and str_to_lower() Convert case of a string

Time to tackle our dataset!

3.2.1 Reformat our wide table with pivot_longer()

As you may recall, our Sunshine data is formatted such that each column represents a host of information salary information for a given year from 1996-2023. However, for us to begin working with this data, we want to move it towards a long-format which requires that we have the data from each calendar year put into a single column, denoting each date. This way, each row will begin to represent a single observation for each unique ID, for a single year entry.

Today we will use the pivot_longer() function to convert our wide-format data over to long-format. For our purposes, we will rely on four parameters:

  1. data: the data frame (and columns) that we wish to transform.
  2. cols: the columns that we wish to gather/collapse into a long format.
  3. names_to: the variable name of the new column to hold the collapsed information from our current columns.
  4. values_to: The variable name of the values for each observation that we are collapsing down.

We’ll be using a series of %>% so for now we won’t save our work to a new object.

# A reminder of what our data looks like
sunshineWide.df %>% head()
## # A tibble: 6 x 8
##   numericID `1996`                     `2001` `2006` `2011` `2016` `2021` `2023`
##       <dbl> <chr>                      <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
## 1  19206219 Other Public Sector Emplo~ NA     NA     NA     NA     NA     NA    
## 2  10627148 Other Public Sector Emplo~ NA     NA     NA     NA     NA     NA    
## 3  17402443 Other Public Sector Emplo~ NA     NA     NA     NA     NA     NA    
## 4  18586778 Ontario Public Service_Ag~ NA     NA     NA     NA     NA     NA    
## 5  17626134 Hospitals_Ajax And Picker~ Hospi~ NA     NA     NA     NA     NA    
## 6  13808138 Colleges_Algonquin Colleg~ Colle~ Colle~ Colle~ NA     NA     NA
# Start with our wide-format phu data
sunshineWide.df %>% 

# Pivot the data into a long-format set
pivot_longer(cols = c(2:8), names_to = "calendarYear", values_to = "combinedData") %>% 

# Just take a quick look at the output.
str()
## tibble [3,618,762 x 3] (S3: tbl_df/tbl/data.frame)
##  $ numericID   : num [1:3618762] 19206219 19206219 19206219 19206219 19206219 ...
##  $ calendarYear: chr [1:3618762] "1996" "2001" "2006" "2011" ...
##  $ combinedData: chr [1:3618762] "Other Public Sector Employers_Addiction Research Foundation_President & Ceo_$194,890.40_$711.24" NA NA NA ...

3.2.2 Remove NA observations from our data with filter()

Our conversion to long format creates 14,475,048 observations relating our numericID data to calendarYear but many of the entries in those observations simply have no values in the combinedData variables since none exist. Therefore, we want to remove those observations that are simply invalid.

One way we could have removed those non-existent values was in our pivot_longer() call with the values_drop_na = TRUE parameter. However, for the sake of practice we’ll work with the filter() verb since we’ll be using that a lot more throughout our workflows.

The filter() function take some of the following parameters:

  1. .data : the data set in question. When working with the %>% operator, this is implicitly assigned by the output from the last function.
  2. ... : the series of predicates/conditions that you want to filter with. This can be a simply conditional statement, or multiple ones in a comma-separated format.

In our case, we are going to use the is.na() function to help us determine which rows to keep. We’ll save the result of this initial transformation to a new variable sunshineLong.df.

# Start with our wide-format phu data

# We'll save our pivoted data into a new variable to save some time in the future
sunshineLong.df <-
  
  sunshineWide.df %>% 

  # Pivot the data into a long-format set
  pivot_longer(cols = c(2:8), names_to = "calendarYear", values_to = "combinedData") %>% 

  # filter out our NA rows
  filter(!is.na(combinedData))

# Check out our resulting data
str(sunshineLong.df)
## tibble [785,777 x 3] (S3: tbl_df/tbl/data.frame)
##  $ numericID   : num [1:785777] 19206219 10627148 17402443 18586778 17626134 ...
##  $ calendarYear: chr [1:785777] "1996" "1996" "1996" "1996" ...
##  $ combinedData: chr [1:785777] "Other Public Sector Employers_Addiction Research Foundation_President & Ceo_$194,890.40_$711.24" "Other Public Sector Employers_Addiction Research Foundation_Dir., Soc. Eval. Research & Act. Dir., Clin. Resear"| __truncated__ "Other Public Sector Employers_Addiction Research Foundation_V.p., Research & Coordinator, Intern. Programs_$149,434.48_$512.58" "Ontario Public Service_Agriculture,Food And Rural Affairs_Deputy Minister_$109,382.92_$4,921.68" ...

By filtering our NA values, we reduced our number of observations by about 2.9M entries!

3.2.3 Split our data column using the separate() function

Looking at our current wrangling output, we see that we still have to deal with the combinedData column which has all of that juicy underscore-separated information. We’ll use the separate() function to help us break apart our column into 5 new columns.

For the separate() function we will use the following parameters:

  1. .data: the dataframe we will use for our function.

  2. col: the specific column we wish to break apart into new columns.

  3. into: a vector of names that we’ll use for naming the new columns we are creating.

  4. sep: the character(s) we want to use the separator for our data. This can be a _ or : or whatever we come across.

As this can be a computationally intensive step, we’ll also be saving this to sunshineLong.df using the compound assignment operator %<>%.

# Start with our long-format phu data
sunshineLong.df %<>% 

  # separate our combinedData column
  separate(., col = combinedData, 
           into = c("sector", "employer", "jobTitle", "salaryPaid", "taxableBenefits"),
           sep = "_")
  

# take a quick look at the structure
str(sunshineLong.df)
## tibble [785,777 x 7] (S3: tbl_df/tbl/data.frame)
##  $ numericID      : num [1:785777] 19206219 10627148 17402443 18586778 17626134 ...
##  $ calendarYear   : chr [1:785777] "1996" "1996" "1996" "1996" ...
##  $ sector         : chr [1:785777] "Other Public Sector Employers" "Other Public Sector Employers" "Other Public Sector Employers" "Ontario Public Service" ...
##  $ employer       : chr [1:785777] "Addiction Research Foundation" "Addiction Research Foundation" "Addiction Research Foundation" "Agriculture,Food And Rural Affairs" ...
##  $ jobTitle       : chr [1:785777] "President & Ceo" "Dir., Soc. Eval. Research & Act. Dir., Clin. Research" "V.p., Research & Coordinator, Intern. Programs" "Deputy Minister" ...
##  $ salaryPaid     : chr [1:785777] "$194,890.40" "$115,603.62" "$149,434.48" "$109,382.92" ...
##  $ taxableBenefits: chr [1:785777] "$711.24" "$403.41" "$512.58" "$4,921.68" ...

3.2.4 Use the stringr package to remove unwanted characters

Looking above at the structure of our data, we can see that our salaryPaid and taxableBenefits columns are of the chr datatype. You can probably sense that intuitively these should be numeric in nature. We cannot, however, just convert these directly but must remove some characters like the “$” and “,” characters that were placed in here.

We can use some simple verbs from the stringr package to help us out. In the process we’ll use mutate() to alter these same variables so we can save their updated state.

# Start with our long-format phu data
sunshineLong.df %<>% 

  # Mutate and update the values in salaryPaid and taxableBenefits
  mutate(salaryPaid = str_remove_all(string = salaryPaid, pattern = r"(\$|,)"),
         taxableBenefits = str_remove_all(string = taxableBenefits, pattern = r"(\$|,)")) %>% 
  
  # Convert the updated variables to the correct data type
  mutate(salaryPaid = as.double(salaryPaid),
         taxableBenefits = as.double(taxableBenefits),
         calendarYear = as.integer(calendarYear),
         numericID = as.factor(numericID)) 
  
# take quick look at the structure
str(sunshineLong.df)
## tibble [785,777 x 7] (S3: tbl_df/tbl/data.frame)
##  $ numericID      : Factor w/ 435634 levels "10000005","10000011",..: 400947 27306 322496 373847 332040 332040 166012 166012 166012 166012 ...
##  $ calendarYear   : int [1:785777] 1996 1996 1996 1996 1996 2001 1996 2001 2006 2011 ...
##  $ sector         : chr [1:785777] "Other Public Sector Employers" "Other Public Sector Employers" "Other Public Sector Employers" "Ontario Public Service" ...
##  $ employer       : chr [1:785777] "Addiction Research Foundation" "Addiction Research Foundation" "Addiction Research Foundation" "Agriculture,Food And Rural Affairs" ...
##  $ jobTitle       : chr [1:785777] "President & Ceo" "Dir., Soc. Eval. Research & Act. Dir., Clin. Research" "V.p., Research & Coordinator, Intern. Programs" "Deputy Minister" ...
##  $ salaryPaid     : num [1:785777] 194890 115604 149434 109383 110309 ...
##  $ taxableBenefits: num [1:785777] 711 403 513 4922 3157 ...

3.2.5 Use the pull() verb to retrieve a column as a vector

Sometimes when you want to quickly assess your data, it can be very helpful to isolate a column to look at its contents. To keep up with the paradigm of piping our calls and keeping our code more readable, I suggest the pull() verb to help retrieve single variables form your data frame. These will be returned as a vector that you can then pass along to functions like unique().

Here we will retrieve the sector variable to see just how many different sectors there are in our data.

# Pull the sector variable and look at it's values
sunshineLong.df %>% 
  # Grab the sector data
  pull(sector) %>% 
  # Determine the unique values
  unique() %>% 
  # Sort them for comparison
  sort()
##  [1] "Colleges"                                                
##  [2] "Crown Agencies"                                          
##  [3] "Government Of Ontario - Judiciary"                       
##  [4] "Government Of Ontario - Legislative Assembly"            
##  [5] "Government Of Ontario - Legislative Assembly And Offices"
##  [6] "Government Of Ontario - Ministries"                      
##  [7] "Hospitals"                                               
##  [8] "Hospitals And Boards Of Public Health"                   
##  [9] "Hydro One And Ontario Power Generation"                  
## [10] "Municipalities"                                          
## [11] "Municipalities And Services"                             
## [12] "Ontario Power Generation"                                
## [13] "Ontario Public Service"                                  
## [14] "Other Public Sector Employers"                           
## [15] "School Boards"                                           
## [16] "Seconded (Advanced Education And Skills Development)*"   
## [17] "Seconded (Attorney General)*"                            
## [18] "Seconded (Cabinet Office)*"                              
## [19] "Seconded (Children And Youth Services)*"                 
## [20] "Seconded (Children, Community And Social Services)*"     
## [21] "Seconded (Citizenship And Multiculturalism)*"            
## [22] "Seconded (Community Safety And Correctional Services)*"  
## [23] "Seconded (Economic Development And Innovation)*"         
## [24] "Seconded (Education)*"                                   
## [25] "Seconded (Energy)*"                                      
## [26] "Seconded (Health And Long-Term Care)*"                   
## [27] "Seconded (Health Promotion And Sport)*"                  
## [28] "Seconded (Health)*"                                      
## [29] "Seconded (Labour)*"                                      
## [30] "Seconded (Solicitor General)*"                           
## [31] "Universities"                                            
## [32] "Universities - Universités"

3.2.6 Update additional character values with stringr verbs

Looking at the output from the sector variable there are a total of 32 unique values. We can see there are some interesting things worth cleaning up in these categories though:

  1. Is there a difference between “Universities” and “Universities - Universités”?
  2. We have sector information buried behind prefixes like “Government Of Ontario” and “Seconded”.
  3. The use of parentheses in the “Seconded” sectors will be replaced with the term “Ministry”.
  4. Some of the sectors have had different names from year to year so we’ll try to consolidate these where possible.
sector entry Updated/combined sector
“Universities - Universités” Universities
“Municipalities” Municipalities And Services
“Legislative Assembly” Legislative Assembly And Offices

Note: Seconded is used to define individuals that have been laterally moved to other departments or sectors for a temporary period (kind of like being on loan). They are technically paid by their original employer BUT for the purposes of this dataset we’ll treat them like they are from the seconded department. Otherwise we would have to do a lot more data wrangling.

sunshineLong.df %<>% 
  # Remove the values of Seconded from the various sector names and only keep what is within the parentheses
  mutate(sector = str_replace_all(sector, pattern = r"(Seconded \((.*)\))", replacement = "Ministry: \\1")) %>% 
  
  # Remove any extra spaces or asterisks from these entries
  mutate(sector = str_remove_all(sector, pattern = r"(\*|\s$)")) %>%
  
  # Remove the "Government of Ontario" prefix
  mutate(sector = str_remove_all(sector, pattern = r"(Government Of Ontario\s[-]\s)")) %>% 
  
  # Combine a few different sector categories into the same ones
  # This is mostly just the result of inconsistencies from year to year
  mutate(sector = str_replace_all(sector,
                                  pattern = "Universities - Universités",
                                  replacement = "Universities")) %>% 
  
  mutate(sector = str_replace_all(sector,
                                  pattern = "Municipalities$",
                                  replacement = "Municipalities And Services")) %>%
  
  mutate(sector = str_replace_all(sector,
                                  pattern = "Legislative Assembly$",
                                  replacement = "Legislative Assembly And Offices")) %>%

  # Now that we've completed our changes, convert the variable to a factor
  mutate(sector = as.factor(sector)) 

# Take a peek at the results  
str(sunshineLong.df)
## tibble [785,777 x 7] (S3: tbl_df/tbl/data.frame)
##  $ numericID      : Factor w/ 435634 levels "10000005","10000011",..: 400947 27306 322496 373847 332040 332040 166012 166012 166012 166012 ...
##  $ calendarYear   : int [1:785777] 1996 1996 1996 1996 1996 2001 1996 2001 2006 2011 ...
##  $ sector         : Factor w/ 29 levels "Colleges","Crown Agencies",..: 27 27 27 26 3 3 1 1 1 1 ...
##  $ employer       : chr [1:785777] "Addiction Research Foundation" "Addiction Research Foundation" "Addiction Research Foundation" "Agriculture,Food And Rural Affairs" ...
##  $ jobTitle       : chr [1:785777] "President & Ceo" "Dir., Soc. Eval. Research & Act. Dir., Clin. Research" "V.p., Research & Coordinator, Intern. Programs" "Deputy Minister" ...
##  $ salaryPaid     : num [1:785777] 194890 115604 149434 109383 110309 ...
##  $ taxableBenefits: num [1:785777] 711 403 513 4922 3157 ...

3.2.7 rename() variables for clarity or simplicity

Looking at the output, we’ve whittled down our sectors from 32 to 29 so it should be easier when we start visualizing that data later in the future. Just a couple more steps before we are done with our wrangling.

Next up we’ll rename our variables just a little by simplifying them using the rename() verb from dplyr. There are a number of ways you could accomplish this without using dplyr but the simplicity of it is nice. The parameters here follow the format of newColumnName = oldColumnName for each column name we want to alter.

# Pass long our sunshine list to rename the columns
sunshineLong.df %>% 
  rename(year = calendarYear,
         title = jobTitle,
         salary = salaryPaid) %>% 
  
  # Take a peek at the results
  str()
## tibble [785,777 x 7] (S3: tbl_df/tbl/data.frame)
##  $ numericID      : Factor w/ 435634 levels "10000005","10000011",..: 400947 27306 322496 373847 332040 332040 166012 166012 166012 166012 ...
##  $ year           : int [1:785777] 1996 1996 1996 1996 1996 2001 1996 2001 2006 2011 ...
##  $ sector         : Factor w/ 29 levels "Colleges","Crown Agencies",..: 27 27 27 26 3 3 1 1 1 1 ...
##  $ employer       : chr [1:785777] "Addiction Research Foundation" "Addiction Research Foundation" "Addiction Research Foundation" "Agriculture,Food And Rural Affairs" ...
##  $ title          : chr [1:785777] "President & Ceo" "Dir., Soc. Eval. Research & Act. Dir., Clin. Research" "V.p., Research & Coordinator, Intern. Programs" "Deputy Minister" ...
##  $ salary         : num [1:785777] 194890 115604 149434 109383 110309 ...
##  $ taxableBenefits: num [1:785777] 711 403 513 4922 3157 ...

3.2.8 Reorder your columns with relocate()

The last cleanup we want to accomplish is to move salary and taxableBenefits closer to the start of our data frame. The reason for this is that these two columns represent actual data points we are interested in while the others are more metadata that we can use later on for sorting.

The relocate() verb from dplyr accomplishes this with ease since we are not dropping or removing columns. It uses some extra syntax to help accomplish its functions:

  1. .data: the data frame or tibble we want to alter
  2. ...: the columns we wish to move
  3. .before or .after: determines the destination of the columns. Supplying neither will move columns to the left-hand side.

In fact, relocate() can be used to rename a column as well but it will also be moved by default so consider the ramifications of such an action!

Note: We could accomplish a similar result using the select() command as well. It’s really up to what you’re comfortable with but it is much simpler to use relocate() when you are working with a large number of columns and you want to move one to a specific location.

We’ll save this final bit of wrangling into the variable sunshineFinal.df.

# Save our result into a new variable
sunshineFinal.df <-
  # Pass along our sunshine list to rename the columns
  sunshineLong.df %>% 
  rename(year = calendarYear,
         title = jobTitle,
         salary = salaryPaid) %>% 
  
  # relocate the measurement data to the left
  relocate(salary, taxableBenefits, .after = numericID)
  
# Take a peek at the results
head(sunshineFinal.df)
## # A tibble: 6 x 7
##   numericID  salary taxableBenefits  year sector                  employer title
##   <fct>       <dbl>           <dbl> <int> <fct>                   <chr>    <chr>
## 1 19206219  194890.            711.  1996 Other Public Sector Em~ Addicti~ Pres~
## 2 10627148  115604.            403.  1996 Other Public Sector Em~ Addicti~ Dir.~
## 3 17402443  149434.            513.  1996 Other Public Sector Em~ Addicti~ V.p.~
## 4 18586778  109383.           4922.  1996 Ontario Public Service  Agricul~ Depu~
## 5 17626134  110309            3157   1996 Hospitals               Ajax An~ Pres~
## 6 17626134  195592.           6517.  2001 Hospitals               Rouge V~ Exec~
# Make a quick copy of our final table
sunshineLong_copy.df <- sunshineLong.df 

Comprehension Question 3.2.9: In the above example we used the relocate() function to move the “salary” and “taxableBenefits” column to near the start of our data frame. What other methods could we use to accomplish the same feat? Use the below code cell to help yourself out.

# comprehension answer code 3.2.9
# Relocate our target column using the select() command

# Use this copy of the sunshine list 
sunshineLong_copy.df %>% 

  # Rename some of the variables
  rename(year = calendarYear,
         title = jobTitle,
         salary = salaryPaid) %>% 
  
  # relocate "salary" and "taxableBenefits" to the right of "numericID"
  ... %>% 

head()
## Error in ...(.): could not find function "..."

3.3.0 Save your data to a file - “Country roads… save to home!”

At this point we have completed the data wrangling we want to accomplish on this dataset. We’ve converted it to a long-format and renamed the Sectors entries while removing any NA values that may cause issues. There are a number of ways we could save this data now either as a text file or in its current form as a data frame in a .RData format.

  • Write out to a delimited file: write_delim(), write_csv(), write_tsv(), write_excel_csv()
  • Write out to a file, line by line: write_lines()
  • Save an object to a .Rdata file: save()
  • Load an object from a .Rdata file: load()

Let’s try some of those methods now.

# Check the files names we currently have
print(dir("./data/"))
## [1] "Sunshine_linePlot_facet.png"        "sunshineList_subset_numID_wide.tsv"
## [3] "sunshineListLong.RData"             "sunshineListLong.tsv"
# Write sunshineFinal.df to a tab-delimited file
...(sunshineFinal.df, file = "./data/sunshineListLong.tsv")
## Error in ...(sunshineFinal.df, file = "./data/sunshineListLong.tsv"): could not find function "..."
# Check our file names after writing
print(dir("./data/"))
## [1] "Sunshine_linePlot_facet.png"        "sunshineList_subset_numID_wide.tsv"
## [3] "sunshineListLong.RData"             "sunshineListLong.tsv"
# Save our data frame as an object
save(sunshineFinal.df, file="./data/sunshineListLong.RData")

# Check our file names after saving
print(dir("./data/"))
## [1] "Sunshine_linePlot_facet.png"        "sunshineList_subset_numID_wide.tsv"
## [3] "sunshineListLong.RData"             "sunshineListLong.tsv"

3.3.0.1 readxl and writexl packages for working with excel spreadsheets

Not all of your data may come as a comma- or tab-delimited format. In the case of excel spreadsheets there are some packages available that can also facilitate the parsing of these more complex files. The readxl package is part of the tidyverse but writexl package is not. There are other means of writing to an excel file format but they are dependent on other programs (like Java or Excel) or their drivers.

From the readxl package

  • Get a list of sheet names from a file: excel_sheets()
  • Read in an excel sheet: read_excel()

From the writexl package (not a part of the tidyverse) but independent of Java and Excel

  • Write out to xlsx format: write_xlsx()
  • Can write a list of objects to separate sheets but cannot append to pre-existing files.

4.0.0 Exploratory data analysis using through summary and plotting

We now have our data in a tidy format - every row is an observation and every column is a variable. While we only have a few numeric data points that are available for summary, we can actually generate quite a few bits of summary information. We’ll do this initially using data summary tables by generating grouped data frames.

4.1.0 Use group_by() to implicitly subset your data

The simplest way to subset your data for analysis is with the group_by() verb. You can specify which variables you’d like to use and it will automatically generate any pre-existing groups that meet your criteria. While it is not necessary for the variables you want to subgroup by to be factor datatypes, it can simplify things since you can quickly calculate how many maximum combinations might exist.

Once the data is grouped, we can use summarise() to create basic summaries on some of the variables. To help us focus, we’ll try to answer a few simple questions:

  1. What was the mean salary for each year? Which year had the highest mean salary?
  2. Within each year, based on sector, what was the highest mean salary; the largest group; the highest total spent in salary?
  3. Which sector(s) are historically the largest in terms of size?

4.1.1 What was the mean salary for each year and which year was highest?

We can approach this question by recognizing that we want to group our data by year to start with. We are interested in mean salary AND knowing which year had the highest mean salary. To break our analysis into steps we would:

  1. Group by year
  2. Calculate mean salary
  3. Sort our data by descending mean salary
# Pass along the data for grouping
sunshineFinal.df %>% 
  ... %>% 
  summarise(total = ...,                       # Calculate group size
            meanSalary = ...) %>%     # Calculate mean salary for the group
  arrange(desc(meanSalary))                    # sort our data
## Error in arrange(., desc(meanSalary)): '...' used in an incorrect context

So 2006 had our highest mean salary. What would it look like if we plotted this data?

4.1.2 Within each year, what was the highest mean salary, the largest group or the highest total spent in salary?

We can’t answer these all right away but we can approach it with the same idea. Always start with a plan!

  1. Group your data by year and sector
  2. Calculate mean salary, group size, and total salary spent
  3. Reanalyse and isolate within each group the values we are looking for
# Pass along the data for grouping
sunshineFinal.df %>% 
  # Group the data by year and sector
  group_by(year, sector) %>% 
  summarise(total = n(),                       # Calculate group size
            meanSalary = mean(salary),         # Calculate mean salary for the group
            totalSalary = ...) %>%     # Calculate the total salary for the group
  
  # regroup the data just by year
  group_by(...) %>%
  # Recalculate the max values in each group
  summarise(maxGroupSize = ...(total),
            maxMean = ...(meanSalary),
            maxTotalSalary = ...(totalSalary))
## Error in summarise(., maxGroupSize = ...(total), maxMean = ...(meanSalary), : '...' used in an incorrect context

4.1.3 Which groups are historically the largest?

This one can seem a little tricky to work out but it’s really a variant of our previous question. Instead of grouping a second time by year, however, we can group by sector to analyse the historical data for each sector over our timespan.

# Pass along the data for grouping
sunshineFinal.df %>% 
  # Group the data by year and sector
  group_by(year, sector) %>% 
  summarise(total = n()) %>%                       # Calculate group size
  
  # Regroup by sector only
  group_by(...) %>%
  
  # Summarise based on each sector as a group
  summarise(maxGroupSize = max(total),
            meanGroupSize = mean(total),
            stdevGroupSize = sd(total)) %>% 
  
  # Rearrange the data based on the biggest group size
  arrange(desc(maxGroupSize))
## Error in summarise(., maxGroupSize = max(total), meanGroupSize = mean(total), : '...' used in an incorrect context

Well, it looks like School Boards have had the highest overall group size over the past 28 years, however, their yearly mean size isn’t quite as large as that of municipalities and services. What we are lacking, however, is the ability to easily see things like trends over time which could tell us if, for instance, School Board-based civil servants are increasing in size over time or shrinking, etc!

4.2.0 Simple graphical analysis of data with ggplot2

While we were able to quickly obtain some cursory information using a group_by() and summarize() approach, it can be hard to dig through the rows and rows of observations in our data. We went to all that trouble to put our data into a tidy format, not just for summarizing but also for ease of visualization! While we will go into ggplot2 in much greater depth during lectures 2 and 3, let’s begin our journey now with a little bit of the basics.

We can begin with some initial analyses of the data using the ggplot2 package. It has all of the components we need to help us decide on which data we want to focus on or keep. There are a number of ways to visualize our data and here we will refresh our ggplot skills.

Basic ggplot notes:

  • ggplot objects hold a complex number of attributes but always need an initial source of data
  • ggplot objects can be modified with the + symbol by adding in layers
    • layers can alter attributes such as which data is displayed and how.
    • Most layers can be modified in one way or another.
  • ggplot objects can be plotted, saved, and passed around.

We’ll begin with a trimmed down dataset where we will filter out the “Ministry:” based entries. Remember, these were originally called “Seconded” sectors and will only take a small part of our dataset. That’s right, it’s easy to filter your data on-the-fly before passing it on to ggplot!

As we start to produce plot figures, they’ll vary in size depending on your needs. In an R Markdown code cell, you can set your figure size using the code cell attributes much like the parameters of a function. You can set the figure size dimensions using fig.width and fig.height. As we proceed in the future, you’ll see us setting these attributes within our code cells.

# Initialize a plot with our summarized data
sunshine.plot <- 
  # Pass the original data
  sunshineFinal.df %>% 
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE)) %>% 
  # Group and summarise the data         
  group_by(year, sector) %>% 
  summarise(total = n(),                       # Calculate group size
            meanSalary = mean(salary),         # Calculate mean salary for the group   
            totalSalary = sum(salary)) %>%     # Total salary spent on the group
  
  ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Take a quick look at the structure of the data
str(sunshine.plot)
## Error in str(sunshine.plot): object 'sunshine.plot' not found

4.2.1 Make a line graph of mean salary for each sector across our time period

We now have a basic plot object initialized but we need to tell it how to display the data associated with it. We’ll begin with a simple line graph of mean salary for all sectors across all dates within the set.

In order to update or add layers to a ggplot object, we can use the + symbol for each command. For instance, to define the source of x-axis and y-axis data, we use aes() command to update the aesthetics layer. Remember how we defined the sector variable as a factor? We’ll take advantage of that here and tell ggplot to give each sector it’s own colour.

After defining our aesthetics, we still need to tell ggplot how to actually graph the data. The ggplot package comes with an abundance of visualizations accessed through the geom_*() commands. Some examples include

  • geom_point() for scatterplots
  • geom_line() for line graphs
  • geom_boxplot() for boxplots
  • geom_violin() for violin plots
  • geom_bar() for bargraphs
  • geom_histogram() for histograms
# Update the aesthetics with axis and colour information, then add a line graph!
sunshine.plot +

    # 2. Aesthetics
    aes(x = ..., y = ..., colour = ...) +
    theme(text = element_text(size = 20)) + # set text size
  
    # Give titles to your axes
    guides(colour = guide_legend(title="Sector")) + # Legend title
    xlab("Year") + # Set the x-axis label
    ylab("Mean Salary") + # Set the y-axis label

    # 4. Geoms
    geom_line()
## Error in eval(expr, envir, enclos): object 'sunshine.plot' not found

4.2.2 Use the facet_wrap() command to break Sectors into separate graphs

There’s a lot of data on that graph and some of it is quite drowned out because of the scale of some Sectors with much higher salaries. To break out each sector individually, we can add the facet_wrap() command. We’ll also update some of the parameters:

  • scale: we will update this so each y-axis scale is determined by PHU-specific data.
  • ncol: use this to set the number of columns displayed in our grid

At the same time, we’ll also get rid of the legend since each individual graph will be labeled by its sector.

# Add a facet_wrap and get rid of the legend
sunshine_facet.plot <- sunshine.plot +

    # 2. Aesthetics
    aes(x = year, y = meanSalary, colour = sector) +
    theme(text = element_text(size = 14)) + # set text size
  
    # Give titles to your axes
    xlab("Year") + # Set the x-axis label
    ylab("Mean Salary") + # Set the y-axis label
    ggtitle("Mean sunshine salary per year across sectors") +
  
    # Remove the legend
    theme(legend.position = "none") +

    # 4. Geoms
    geom_line() +

    # 7. ### 4.2.2 Facet our data by sector
    facet_wrap(~ ..., scales = ..., ncol=...)
## Error in eval(expr, envir, enclos): object 'sunshine.plot' not found
# Display our plot
sunshine_facet.plot
## Error in eval(expr, envir, enclos): object 'sunshine_facet.plot' not found

4.3.0 Use the ggsave() command to save your plots to a file

There are a number of ways you can use the ggsave() command to specify how you want to save your files.

# What is our working directory?
getwd()
## [1] "C:/Users/mokca/Dropbox/!CAGEF/Course_Materials/Advanced_Graphics_in_R/2025.03_Adv_Graphics_R/Lecture_01_R_Introduction"
# Save the plot we've generated to the root directory of the lecture files.
ggsave(..., 
       filename = "data/Sunshine_linePlot_facet.png", 
       scale=2, 
       device = "png", 
       units = c("cm"), width = 20, height = 30)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# Take a look at the directory
dir("data/")
## [1] "Sunshine_linePlot_facet.png"        "sunshineList_subset_numID_wide.tsv"
## [3] "sunshineListLong.RData"             "sunshineListLong.tsv"

4.4.0 Barplots can be used to summarize your data across sectors

Although we do have a running total for each year, what if we want to look at the totals individuals across our sectors? Using a barplot we can stack sectors by year and get a sense of yearly totals individuals by sector.

This time we will use geom_bar() to display our data and tell it to use the values from our total variable in our data to generate the totals. We do this by setting the stat = "identity" parameter.

sunshine.plot +
  # 2. Aesthetics
  aes(x = year, y= total, fill = ...) + # set our fill colour instead of line colour
  
  theme(text = element_text(size = 14)) + # set text size
  guides(fill = guide_legend(title="Sector")) +
  
  # Give titles to your axes
  xlab("Year") + # Set the x-axis label
  ylab("Total Individuals") + # Set the y-axis label
  ggtitle("Yearly breakdown of servants by sector") +
  
  # Set up our barplot here
  geom_bar(...) 
## Error in eval(expr, envir, enclos): object 'sunshine.plot' not found

Look like our number of public servants with salaries above $100K is rising year-by-year! That should be a good thing! Going back to our third question from section 4.1.3 we can see that visually, the school boards sector in recent years has had the most employees on the Sunshine list but in the years before that, Municipalities and Services tended to be the larger group.

4.4.1 Change your data by updating your axis variables!

Returning back to our question, how does total salary payout look between our various sectors? We can quickly change up our graph parameters so that we are viewing totalSalary instead. We just need to set our y-axis properly.

sunshine.plot +
  # 2. Aesthetics
  aes(x = year, y= ..., fill = sector) + # set our fill colour instead of line colour
  
  theme(text = element_text(size = 14)) + # set text size
  guides(fill = guide_legend(title="Sector")) +
  
  # Give titles to your axes
  xlab("Year") + # Set the x-axis label
  ylab("Total Salary Paid") + # Set the y-axis label
  ggtitle("Yearly breakdown of total salary paid by sector") +
  
  # Set up our barplot here
  geom_bar(stat = "identity") 
## Error in eval(expr, envir, enclos): object 'sunshine.plot' not found

It looks nearly identical to our breakdown of size. This is actually pretty good to see as it suggests that salaries in these groups tends to be very similar. You would need to do more in-depth analyses BUT we can leave that for your assignment.

It would also be good to determine more clearly, however, what percentage of each year the various sectors comprise but we’ll save that for next week.


4.4.2 View individual datapoints with geom_point()

Before we wrap up, let’s take a closer look at our data by zooming in on a single year. We’ll filter our data to look at data from 2023 and then plot all of the salaries as single datapoint, categorized by sector.

Using the geom_point() layer, we’ll be able to plot each observation in our dataset. The resulting visualization would be considered a strip-plot rather than a standard scatterplot or biplot.

Note that we are also accessing the theme() layer here to adjust parts of our plot. We’ll spend most of lecture 03 learning to manipulate many of our smaller details. For now, we’ll adjust our x-axis text so it is at a 45-degree angle.

sunshineFinal.df %>% 
  # Filter the data by year
  filter(year == ...,
         str_detect(string = sector, pattern = "Ministry:", negate = TRUE)) %>% 
  
  ggplot() +
  
  # 2. Aesthetics
  aes(x = sector, y = salary, colour = sector) +
  
  # Remove the legend
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + # rotate our x-axis text to 45 degrees
  
  # Give titles to your axes
  xlab("Sector") + # Set the x-axis label
  ylab("Salary Paid") + # Set the y-axis label
  ggtitle("2024 breakdown of salaries paid by sector") +
  
  # 4. Geoms
  ...
## Error in `filter()`:
## i In argument: `year == ...`.
## Caused by error:
## ! '...' used in an incorrect context

Wow, some folks at Ontario Power Generation are making a LOT of money! That’s a lot of pay for steering a company that doesn’t have many competitors within the province! From our figure we get a sense of the pay range in each sector, although we can’t properly see the full distribution of our sectors. We’ll work on that in the coming weeks.


5.0.0 Class summary

That’s our first class! If we’ve made it this far, we’ve reviewed

  1. Foundational concepts in R
  2. Helpful functions in generating tidy data for analysis
  3. Basics of visualizations using the ggplot2

We took a “messy” dataset from the Ontario government and created a tidy data set that we were able to visualize. We also took the time to summarize our data based on specific groups to get a better picture of how salaries are distributed across sectors and over time.

Next week? Getting deeper into ggplot2!


5.1.0 Weekly assignment

This week’s assignment will be found under the current lecture folder under the “assignment” subfolder. It will include an R markdown notebook that you will use to produce the code and answers for this week’s assignment. Please provide answers in markdown or code cells that immediately follow each question section.

Assignment breakdown
Code 50% - Does it follow best practices?
- Does it make good use of available packages?
- Was data prepared properly
Answers and Output 50% - Is output based on the correct dataset?
- Are groupings appropriate
- Are correct titles/axes/legends correct?
- Is interpretation of the graphs correct?

Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.

You can save and download the markdown notebook in its native format. Submit this file to the the appropriate assignment section by 12:59 pm on the date of our next class: March 14th, 2024.


5.2.0 Acknowledgements

Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 2.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 3.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2025 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


6.0.0 Appendix 1: Instructions for installing your own software

6.1.0 R and RStudio

6.1.1 Installing R

As of 2022-03-01, the latest stable R version is 4.4.3:

Windows:
- Go to http://cran.utstat.utoronto.ca/
- Click on ‘Download R for Windows’
- Click on ‘install R for the first time’
- Click on ‘Download R 4.4.3 for Windows’ (or a newer version)
- Double-click on the .exe file once it has downloaded and follow the instructions.

(Mac) OS X:
- Go to http://cran.utstat.utoronto.ca/
- Click on ‘Download R for (Mac) OS X’
- Click on R-4.4.3 .pkg (or a newer version)
- Open the .pkg file once it has downloaded and follow the instructions.


Linux:
- Open a terminal (Ctrl + alt + t) - sudo apt-get update
- sudo apt-get install r-base
- sudo apt-get install r-base-dev (so you can compile packages from source)


6.1.2 Installing RStudio

As of 2025-03-05, the latest RStudio version is 2024.12.1+563 (released 2025-02-13)

Windows (10/11):
- Go to https://posit.co/downloads/
- Click on ‘RSTUDIO-2024.12.1-563.EXE’ to download the installer (or a newer version)
- Double-click on the .exe file once it has downloaded and follow the instructions.

(Mac) OS X (11+):
- Go to https://posit.co/downloads/
- Click on ‘RSTUDIO-2024.12.1-563.DMG’ to download the installer (or a newer version)
- Double-click on the .dmg file once it has downloaded and follow the instructions.


Linux:
- Go to https://posit.co/downloads/
- Click on the installer that describes your Linux distribution, e.g. ‘RSTUDIO-2022.12.0-353-AMD64.DEB’ (or a newer version)
- Double-click on the .deb file once it has downloaded and follow the instructions.
- If double-clicking on your .deb file did not open the software manager, open the terminal (Ctrl + alt + t) and type sudo dpkg -i /path/to/installer/RSTUDIO-2024.12.1-563-AMD64.deb

 _Note: You have 3 things that could change in this last command._     
 1. This assumes you have just opened the terminal and are in your home directory. (If not, you have to modify your path. You can get to your home directory by typing cd ~.)     
 2. This assumes you have downloaded the .deb file to Downloads. (If you downloaded the file somewhere else, you have to change the path to the file, or download the .deb file to Downloads).      
 3. This assumes your file name for .deb is the same as above. (Put the name matching the .deb file you downloaded).

If you have a problem with installing R or RStudio, you can also try to solve the problem yourself by Googling any error messages you get. You can also try to get in touch with me or the course TAs.


6.1.3 Getting to know the RStudio environment

RStudio is an IDE (Integrated Development Environment) for R that provides a more user-friendly experience than using R in a terminal setting. It has 4 main areas or panes, which you can customize to some extent under Tools > Global Options > Pane Layout:

  1. Source - The code you are annotating and keeping in your script.
  2. Console - Where your code is executed.
  3. Environment - What global objects you have created and functions you have written/sourced.
    History - A record of all the code you have executed in the console.
    Connections - Which data sources you are connecting to. (Not being used in this course.)
  4. Files, Plots, Packages, Help, Viewer - self-explanatoryish if you click on their tabs.

All of the panes can be minimized or maximized using the large and small box outlines in the top right of each pane.

6.1.3.1 Source

The Source is where you are keeping the code and annotation that you want to be saved as your script. The tab at the top left of the pane has your script name (i.e. ‘Untitled.R’), and you can switch between scripts by toggling the tabs. You can save, search or publish your source code using the buttons along the pane header. Code in the Source pane is run or executed automatically.

To run your current line of code or a highlighted segment of code from the Source pane you can:
a) click the button 'Run' -> 'Run Selected Line(s)',
b) click 'Code' -> 'Run Selected Line(s)' from the menu bar,
c) use the keyboard shortcut CTRL + ENTER (Windows & Linux) Command + ENTER (Mac) (recommended),
d) copy and paste your code into the Console and hit Enter (not recommended).

There are always many ways to do things in R, but the fastest way will always be the option that keeps your hands on the keyboard.

6.1.3.2 Console

You can also type and execute your code (by hitting ENTER) in the Console when the > prompt is visible. If you enter code and you see a + instead of a prompt, R doesn’t think you are finished entering code (i.e. you might be missing a bracket). If this isn’t immediately fixable, you can hit Esc twice to get back to your prompt. Using the up and down arrow keys, you can find previous commands in the Console if you want to rerun code or fix an error resulting from a typo.

On the Console tab in the top left of that pane is your current working directory. Pressing the arrow next to your working directory will open your current folder in the Files pane. If you find your Console is getting too cluttered, selecting the broom icon in that pane will clear it for you. The Console also shows information: upon start up about R (such as version number), during the installation of packages, when there are warnings, and when there are errors.

6.1.3.3 Environment

In the Global Environment you can see all of the stored objects you have created or sourced (imported from another script). The Global Environment can become cluttered, so it also has a broom button to clear its workspace.

Objects are made by using the assignment operator <-. On the left side of the arrow, you have the name of your object. On the right side you have what you are assigning to that object. In this sense, you can think of an object as a container. The container holds the values given as well as information about ‘class’ and ‘methods’ (which we will come back to).

Type x <- c(2,4) in the Console followed by Enter. 1D objects’ data types can be seen immediately as well as their first few values. Now type y <- data.frame(numbers = c(1,2,3), letters = c("a","b","c")) in the Console followed by Enter. You can immediately see the dimension of 2D objects, and you can check the structure of data frames and lists (more later) by clicking on the object’s arrow. Clicking on the object name will open the object to view in a new tab. Custom functions created in session or sourced will also appear in this pane.

The Environment pane dropdown displays all of the currently loaded packages in addition to the Global Environment. Loaded means that all of the tools/functions in the package are available for use. R comes with a number of packages pre-loaded (i.e. base, grDevices).

In the History tab are all of the commands you have executed in the Console during your session. You can select a line of code and send it to the Source or Console.

The Connections tab is to connect to data sources such as Spark and will not be used in this lesson.

6.1.3.4 Files, Plots, Packages, Help, Viewer

The Files tab allows you to search through directories; you can go to or set your working directory by making the appropriate selection under the More (blue gear) drop-down menu. The ... to the top left of the pane allows you to search for a folder in a more traditional manner.

The Plots tab is where plots you make in a .R script will appear (notebooks and markdown plots will be shown in the Source pane). There is the option to Export and save these plots manually.

The Packages tab has all of the packages that are installed and their versions, and buttons to Install or Update packages. A check mark in the box next to the package means that the package is loaded. You can load a package by adding a check mark next to a package, however it is good practice to instead load the package in your script to aid in reproducibility.

The Help menu has the documentation for all packages and functions. For each function you will find a description of what the function does, the arguments it takes, what the function does to the inputs (details), what it outputs, and an example. Some of the help documentation is difficult to read or less than comprehensive, in which case goggling the function is a good idea.

The Viewer will display vignettes, or local web content such as a Shiny app, interactive graphs, or a rendered html document.

6.1.3.5 Global Options

I suggest you take a look at Tools -> Global Options to customize your experience.

For example, under Code -> Editing I have selected Soft-wrap R source files followed by Apply so that my text will wrap by itself when I am typing and not create a long line of text.

You may also want to change the Appearance of your code. I like the RStudio theme: Modern and Editor font: Ubuntu Mono, but pick whatever you like! Again, you need to hit Apply to make changes.

That whirlwind tour isn’t everything the IDE can do, but it is enough to get started.


The Center for the Analysis of Genome Evolution and Function (CAGEF)

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.